Conversation
There was a problem hiding this comment.
This unfortunately makes the test as expensive as loading the data with .get('column').
How do you use this downstream?
There was a problem hiding this comment.
I have some datasets where certain transcripts are only computed partially. I could also catch the exception in my code, but I felt it is a bit counter intuitive if __contains__ succeeds, but .get() fails.
There was a problem hiding this comment.
Feels like in most cases you want to do something like
if "col" in sample:
# Do something with sample["col"]
else:
# Do something else
There was a problem hiding this comment.
We could do the more thorough check if include_partial_shards is set on the dataset?
|
|
||
| self._filter_dfs[filter_name] = filter_df | ||
|
|
||
| rows_satisfying_filter = filter_df.sum().item() |
There was a problem hiding this comment.
@rashishhume do you maybe know what's going on here?
There was a problem hiding this comment.
I deleted this since, it caused a lot of log output at the start of my training jobs. Maybe it could make more sense to log this stuff higher up.
| if shard_subsample != 1: | ||
| shard_list = rng.sample(shard_list, int(len(shard_list) * shard_subsample)) | ||
|
|
||
| # TODO: Not sure if we want to drop the columns. I think previously we |
There was a problem hiding this comment.
We could apply the renaming on SQL as well which would make it consistent with the non-SQL API.
I think most of the issues with duplicate fields were fixed recently when we added the select in this line:
df = scan_ipc(shard_path, glob=False).select(fields)
Do you remember which duplicate columns are causing you headaches?
There was a problem hiding this comment.
It was language_whisper.txt I think, since this is contained in all the shards with Whisper transcripts.
There was a problem hiding this comment.
Oh, interesting. We need to fix this then.
| ) | ||
|
|
||
| return exprs, pl.concat(row_merge).select(exprs) | ||
| def _common_dtype(col_name: str, a: pl.DataType, b: pl.DataType) -> pl.DataType: |
There was a problem hiding this comment.
My guess would be that this is about some shards having a null type because all the samples were None for a column?
Would concat with how="vertical_relaxed" help in this situation? (this would let Polars handle the coercion, hopefully in a sensible way)
There was a problem hiding this comment.
The issue is that some metrics have shards stored as both float16 and float64.
There was a problem hiding this comment.
I'll try vertical_relaxed (I remember some polars merge mode not handling the issue I was facing, not 100% sure that was vertical_relaxed)
|
@jpc it seems like you already have better solutions for some of these issues. These we're mostly fixes I added when I was in the process of getting SFT to run. I'm fine with not merging this. |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ultiple SQL expressions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WSsample.__contains__look at shards loaded, prevents some errors when working with partially computed features.